# KNITR MUST BE VERSION 1.42 TO RENDER MAPS# Library Importslibrary('tidyverse')library('janitor')library("scales")library("sf")library('leaflet')library('tippy')library('xfun')library('ggpubr')library('flextable')# Needed to clean names for the inline code during introduction. More involved cleaning will be discussed.raw_df <- readr::read_csv('Data/DATA2x02 survey (2023) (Responses) - Form responses 1.csv') |> janitor::clean_names()# Adding tool tips for a few key termstippy::tippy_this(elementId ="random_sample", tooltip ="When all members of a population have equal likelihood to be sampled.")
Code
tippy::tippy_this(elementId ="wam", tooltip ="Weighted Average Mark")
DATA2X02 is a group of two units – DATA2002 and DATA2902 – offered within the School of Mathematics and Statistics at The University of Sydney. The units teach “advanced data analytic skills for a wide range of problems and data” (The University of Sydney 2023) with a focus on statistical methods to analyse and answer a scientific question.
1.1 Survey Method and Random Sampling
The raw dataset provided was sourced from a cohort survey which aimed to gain insight into the units’ cohort. Despite efforts to encourage student participation in the survey through an Ed Discussion Announcement and multiple reminders in labs and lectures, the response rate was 41%. It is important to note that due to this method of communication, there exists an argument that the survey participants may not have been a random sample of DATA2X02 students.
Students who were less engaged – possibly not attending lectures, labs, or interacting with the Ed Discussion Board – are considerably less likely to have completed the survey compared to their counterparts who received multiple prompts. Moreover, those who are more engaged are likely to take time out of their day to fill out the survey after a reminder. This is evidenced by DATA2902 (the advanced stream of DATA2X02) having a response rate of 71% compared to DATA2002’s rate of 37%. Students could also submit the survey multiple times, which may have skewed the data towards an individuals who submitted multiple responses, but EDA showed that this did not occur in a major way.
Whilst acknowledging these shortcomings of the sampling method and subsequent response pattern, it is taken that the survey still offers a moderately random sample of the DATA2X02 cohort and that responses were (for the most part) independent from one another. For more detailed analysis, a new dataset should be sourced from a different surveying method, ensuring that more students submit the survey and restricting responses to one per person.
1.2 Sources of Bias
There are some potential biases that may have occurred during this survey.
Non-response Bias – As discussed in Section 1.1, there may have been a non-response bias within the survey. Specifically, we see a difference in response rates between DATA2902 and DATA2002 students. This may have skewed the sample data towards the population of DATA2902 students, rather than DATA2X02 as a whole. This would be an issue if there is a major difference between the populations of the two units. This is not out of the question, as those who opt to take an advanced stream of a unit may be more willing to challenge themselves and put more effort into their studies. Moreover, there is the possibility that students do not opt for an advanced unit in order to priorities other aspects of their lives, such as work.
Social desirability/conformity bias – Many of the questions asked in the survey have an associated ‘socially desirable’ response. For example, students may, whether consciously or unconsciously, overestimate the amount of hours they exercise, or underestimate the amount of time they spend on social media as these answers come with positive social connotations. Moreover, students may want to conform to the expected answer of the population. An example of this may be the question of whether or not students had experience in R coding. The majority of the DATA2X02 would have had experience in R as it was taught in many prerequisite courses, so those who didn’t have experience may have answered incorrectly to conform with the rest of the cohort.
Recall Bias – Even if students did not suffer from social desirability or conformity bias, they may have simply not been able to recall the correct answer to a question. An example of this would be someone’s WAM. Many students may not know their actual WAM (as it is not reported when getting results or on the online academic transcript), and so they could incorrectly recall it when answering the survey. An instance of this is seen in the WAMs reported, with three students reporting their WAM of 99 or above, a value that could potentially be less accurate due to difficulties in recall or a deliberate distortion.
1.3 Possible Improvements
There are many possible improvements that help to generate useful data. Many of the questions regarding numeric data did not specify units in which an answer should be, or whether the units should be included in the answer. This can be changed by specifying units in the question and only allowing numeric data to be input into the survey rather than free text. One such question was How much sleep do you get (on average, per day)?. A better wording of this question would be How much sleep (in hours per night) do you get on average?. This was also an issue for the question How tall are you?, where answers were not given in a uniform manner. Rewording to How tall are you in cm? would have produced data that required much less cleaning. This extends to What is your shoe size?, where students responded with both US and European shoe sizes which are on a very different scale (a US 10 is a 43 European).
There were also issues regarding the categorical data. The question Would you prefer to study at Fisher Library or SciTech Library? did not need to include an Other response, as any answer of this type would not be answering the question asked. Moreover, the question Do you work? did not align with the suggested responses given. This question should have been What is your current employment status?. A similar issue was seen in this question Do you submit assignments on time?, which should have been How often do you submit assignments on time?. Finally, some questions could have included some options and an Other response, rather than free text. This was a particular issue for What brand is your laptop? and What is your favourite social media platform?, where students gave answers in many different forms when referring to the same category, e.g. Apple and Macbook being the same laptop brand. By providing some pre-defined answers, this would reduce the need for data cleaning.
1.4 Report Outline
This report will focus on the geographical characteristics of the cohort, with the Postcode of each response being used as a proxy for where a student lives. Specifically, hypothesis testing will be used to determine the impact of a student’s geographical region on a variety of variables.
SA4s are the “largest sub-State regions” and “represent labour markets or groups of labour markets within each State and Territory” (Australian Bureau of Statistics 2021), with each SA4 having approximately 300,000 - 500,000 residents in metropolitan areas. These regions will be used to group together students into the geographical areas with ‘geographical, social and economic similarities’ (Australian Bureau of Statistics 2021). Figure 1 is a map made using Leaflet(Cheng, Karambelkar, and Xie 2023) which showcases the SA4s of Greater Sydney1.
A variety of data cleaning has been done in R (R Core Team 2023) and R Studio (RStudio Team 2020) utilising the tidyverse packages (Wickham et al. 2019). The janitor package (Firke 2023) was initially used to help standardise the names of each column so that a reproducible introduction could be made. A new naming convention for the columns was adopted based on Tarr (2023). Some summary tables have also been created using gt(Iannone et al. 2023).
The SA4 name of each respondent was joined to the survey data using a reference table made by Proctor (2023). The HTML Table on the website (Proctor 2023) was converted into a CSV file for easier manipulation (Data Design Group 2023).
Code
sa4_postcode_df <- readr::read_csv('Data/sa4_postcode.csv') |>select(c(`Postcode`, `SA4 Name`)) |>unique() |>filter(!((`Postcode`==2232) & (`SA4 Name`=='Southern Highlands and Shoalhaven')))colnames(sa4_postcode_df) <-c('post_code', 'sa4_name')sa4_postcode_df$post_code <-as.character(sa4_postcode_df$post_code) df$post_code <-as.character(gsub("[^0-9]", "", df$post_code))df <- df |>left_join(sa4_postcode_df)df |>count(sa4_name) |>arrange(desc(n)) |> gt::gt() |> gt::cols_label(sa4_name ="SA4 Name", n='Count of Students') |> gt::tab_header(title ="Count of Students by SA4") |> gt::tab_options(heading.title.font.weight ='bolder', column_labels.font.weight ='bold')
The SA4s were further grouped together geographically to collapse some of the groups with lower student counts. Figure 4 is a map of the groupings of SA4s into regions. A conversion table was generated using flextable(Gohel and Skintzos 2023).
SA4 to Region Conversion Table
Code
north_sydney =c('Sydney - North Sydney and Hornsby', 'Sydney - Ryde', 'Sydney - Northern Beaches')city_and_eastern_suburbs =c('Sydney - City and Inner South', 'Sydney - Eastern Suburbs')inner_west =c('Sydney - Inner West', 'Sydney - Parramatta', 'Sydney - Inner South West')df <- df |>mutate(geographic_regions =case_when( sa4_name %in% north_sydney ~'North Sydney', sa4_name %in% city_and_eastern_suburbs ~'City and Eastern Suburbs', sa4_name %in% inner_west ~'Inner West',!is.na(sa4_name) ~'Outer South West, Greater Sydney and Regional NSW',TRUE~NA ))mapping_df <- df |>select(geographic_regions, sa4_name) |>unique() |>drop_na() |>arrange(geographic_regions) |>mutate(`Region`= geographic_regions, `SA4 Name`=sa4_name) |>select(Region, `SA4 Name`)flextable(mapping_df) |>merge_v() |>theme_vanilla() |>width(2, 4) |>width(1, 2)
A flagging column was made that identified if someone travelled to the university by car.
Code
df <- df |>mutate(car_flag =ifelse(str_detect(uni_travel_method, "Car"),"Drive", ifelse(is.na(uni_travel_method), NA, "Other")))df |>count(car_flag) |> gt::gt() |> gt::cols_label(car_flag ="Does the Student Drive to Univeristy?", n='Count of Students') |> gt::tab_header(title ="Count of Students by Whether or Not they Travel by Car") |> gt::tab_options(heading.title.font.weight ='bolder', column_labels.font.weight ='bold')
Count of Students by Whether or Not they Travel by Car
The employment hours of each respondent were binned into categories of \(0\) hours, \(1-10\) hours, and \(11+\) hours. This was done to provide a more numerical approach to a respondent’s employment status. Even if someone were to have the same employment status, their actual work hours may be greatly different. So these bins were created in order to have a better understanding of how much a respondent works. Moreover, bins were used as a near majority (45%) of respondents worked no hours, which skewed the means of groups towards \(0\).
Code
bin_ranges <-c(0, 1, 10.5, Inf)bin_labels <-c("0", "1-10","11+")# Create a new column with binned valuesdf$employment_hrs_bin <-cut(df$employment_hrs, breaks = bin_ranges, labels = bin_labels, include.lowest =TRUE)df |>count(employment_hrs_bin) |> gt::gt() |> gt::cols_label(employment_hrs_bin ="Employment Hours", n='Count of Students') |> gt::tab_header(title ="Count of Students by Employment Hours") |> gt::tab_options(heading.title.font.weight ='bolder', column_labels.font.weight ='bold')
Outliers of WAM were set to NA, as this may be international students who have a different WAM system or people who do not know their WAM. It was judged at the \(\pm 3\) standard deviations from the mean, which is a common method for removing outlines. Figure 7 shows a histogram of respondents’ WAM.
Figure 7: Histogram of students’ WAM with outliers removed (Wickham 2016)
In all hypothesis tests, the NA values of the columns investigated were removed as these either represented answers that were not filled out from the respondent, or values which were deemed outliers.
2 Hypothesis Testing
2.1 Does living in Sydney’s City and Eastern Suburbs influence if students drive to university?
As the University of Sydney is located in Sydney’s City and Eastern Suburbs, it is suspected that students may opt for the use of public transport, rather than driving to the university, if they live close to the university. This is of interest as effective carbon emissions of the University can be reduced if more students use public transport. This is also suggested from EDA which showed uneven proportions between the two groups, as seen in Figure 8.
Code
car_df <- df |>select(c(geographic_regions, car_flag)) |>mutate(geographic_regions =ifelse(geographic_regions =='City and Eastern Suburbs', 'City and Eastern Suburbs', 'Other')) |>drop_na() |>mutate(`Travel Method`= car_flag)car_df |>ggplot() +aes(x=geographic_regions, fill=`Travel Method`) +geom_bar(colour ="black",linewidth =0.5,position ="fill") +labs(y="Proportion of Travel Method",x="Region", title="Proportion of Students who drive to Univerity \n based on Geograhical Location",legend="Travel Method") +theme(plot.background =element_rect(fill ="#ffffff",linewidth =0),legend.background =element_rect(fill ="#ffffff", linewidth =0),panel.border =element_rect(colour ="black", fill=NA),legend.box.background =element_rect(colour ="black"),axis.title =element_text(face="bold"), plot.title =element_text(face="bold", size =14, hjust =0.5)) +scale_y_continuous(labels = scales::percent) +scale_fill_brewer(palette ="Set2")
Figure 8: Proportion bar chart of travel method for different regions (Wickham 2016)
A \(\chi^2\)-test for independence was performed at the \(\alpha = 0.05\) level on the below contingency table. A Monte-Carlo simulation of size \(6000\) was used to calculate the distribution of test statistics and \(p\) value.
Code
contingency_table <-table(car_df$geographic_regions, car_df$car_flag) |>as.data.frame.matrix()contingency_table$`Region`=c('City and Eastern Suburbs', 'Other')contingency_table |> gt::gt() |> gt::cols_move_to_start(columns=c(`Region`)) |> gt::tab_spanner(label ="Method of Travel", columns =1:2) |> gt::tab_header(title ="Count of Students by Method of Travel") |> gt::tab_options(heading.title.font.weight ='bolder', column_labels.font.weight ='bold')
Count of Students by Method of Travel
Region
Method of Travel
Drive
Other
City and Eastern Suburbs
13
118
Other
48
101
Figure 9: Contingency Table of Region and Method of Travel (Iannone et al. 2023)
Hypothesis – \(H_0\): The method of travel of a student is independent of whether or not they live in Sydney’s City and Eastern Suburbs. \(H_1\): There is some interdependence between the a student’s method of travel and whether or not they live in Sydney’s City and Eastern Suburbs.
Assumptions - The test statistic is a good measure of independence and that the resampling process provides a good estimation for the distribution of the test statistic. These assumptions are true as, at the limit of resampling an infinite amount of times, the test statistic is distributed the same as a \(\chi^2\)-test for independence. The observations are independent. Despite some concerns raised in Section 1.1, the observations are likely to be independent from each other.
p-value – The proportion of simulated test statistics that were as or more extreme than \(t_0\) was \(p=\) 0.00017.
Decision – As the \(p\) -value was \(<\alpha\), we cannot reject \(H_0\). This implies that there is some interdependence between the method of travel and whether or not a student lives in Sydney’s City and Eastern Suburbs.
2.2 Is academic performance significantly better for students living in North Sydney compared to those in the Inner West?
A student’s WAM is one measure of academic performance. Knowing if WAM is impacted by where students live could be useful, as it could allow the University to provide targeted academic support.
A Welch two-sample one-sided \(t\)-test at the \(\alpha = 0.05\) level was conducted to determine if the mean WAM of students in North Sydney is larger than those living in the Inner West. This test is less powerful then a two-sample t-test, but forgoes the assumption of equal variance. This was done because North Sydney has a variance of 88.8 compared to Inner West having 101.5.
Initial EDA suggests this may be the case, with the mean WAM of students being 76.4 and 74.1 respectively. We can also generate a QQ-plot of students’ WAM, which shows the variable is likely to be normally distributed as it follows a linear regression.
Figure 12: Histogram of WAMs of students from the Inner West and North Sydney (Wickham 2016)
Welch two-sample one-sided \(t\)-test
Hypothesis – \(H_0\): The mean WAM of students from North Sydney \(\mu_{NS}\) equals the mean WAM of students from the Inner West \(\mu_{IW}\). \(H_1\): \(\mu_{NS}\) is greater than \(\mu_{NS}\).
Assumptions – The observations of both groups were independently and identically distributed to \(\mathcal{N}(\mu_{i}, \sigma_{i}^2)\) for \(i=NS, IW\), and that the observations of each group were independent. Despite some concerns raised in Section 1.1, the groups are likely to be independent from each other. A student may have responded twice with different values of WAM or Postcode, which could make the groups not independent. The above QQ-plot (Figure 11) shows that the WAM is likely to be normally distributed. Moreover using a Shapiro-Wilk test, both groups were consistent with a \(X\sim\mathcal{N}(\mu_{i}, \sigma_{i}^2)\), with p values of 0.223 for Inner West and 0.273 for North Sydney.
Test Statistic – \[T=\frac{\overline{NS}-\overline{IW}}{\sqrt{\frac{S_{ns}^2}{n_{ns}}+\frac{S_{iw}^2}{n_{iw}}}}\] Here, \(S_{ns}^2\) and \(S_{iw}^2\) are the sample variance of the \(NS\) (North Sydney) and \(IW\) (Inner West) samples. Under \(H_0\), \(T\sim t_{\nu}\), where \(\nu=\) 104.47 as estimated from the data.
Decision – As the \(p\) -value was \(>\alpha\), we cannot reject \(H_0\) and say the data is consistent with the mean WAM of students in North Sydney and the Inner West being equal.
2.3 Does a student’s Region have a significant influence on how many hours they work?
Initial exploration of the data set suggested that there was a non-uniform distribution of working hours across different regions. The proportion of students working between one and 10 hours per week was relatively similar, and the main differences were observed when comparing the proportion of students working no hours or more than 11 hours a week.
Code
employment_df <- df |>select(c(geographic_regions, employment_hrs_bin)) |>drop_na() |>mutate(`Employment Hours per Week`= employment_hrs_bin)employment_df |>ggplot() +aes(x=geographic_regions, fill=`Employment Hours per Week`) +geom_bar(colour ="black",linewidth =0.5,position ="fill") +labs(y="Proportion of Hours Worked Category",x="Region", title="Proportion of Students in Hours Worked by Region",legend="Travel Method") +theme(plot.background =element_rect(fill ="#ffffff",linewidth =0),legend.background =element_rect(fill ="#ffffff", linewidth =0),panel.border =element_rect(colour ="black", fill=NA),legend.box.background =element_rect(colour ="black"),axis.title =element_text(face="bold"), plot.title =element_text(face="bold", size =14, hjust =0.5)) +scale_y_continuous(labels = scales::percent) +scale_x_discrete(labels=c("City and \n Eastern Suburbs", "Inner West", "North Sydney", "Outer South West, \n Greater Sydney and \n Regional NSW")) +scale_fill_brewer(palette ="Set2")
Figure 13: Proportion bar chart of hours worked for different regions (Wickham 2016)
A \(\chi^2\)-test for independence was performed at the \(\alpha = 0.05\) level on the below contingency table. Yates’s correction for continuity was used in the test.
Code
contingency_table <-table(employment_df$geographic_regions, employment_df$employment_hrs_bin) |>as.data.frame.matrix()contingency_table$`Region`=c('City and Eastern Suburbs', 'Inner West', 'North Sydney', 'Outer South West,\n Greater Sydney and \n Regional NSW')contingency_table |> gt::gt() |> gt::cols_move_to_start(columns=c(`Region`)) |> gt::tab_spanner(label ="Hours Worked", columns =1:3) |> gt::tab_header(title ="Count of Students by Hours Worked") |> gt::tab_options(heading.title.font.weight ='bolder', column_labels.font.weight ='bold')
Count of Students by Hours Worked
Region
Hours Worked
0
1-10
11+
City and Eastern Suburbs
82
27
19
Inner West
34
15
14
North Sydney
15
15
27
Outer South West,
Greater Sydney and
Regional NSW
6
8
11
Figure 14: Contingency Table of Region and Hours Worked (Iannone et al. 2023)
Code
test <-chisq.test(table(employment_df$geographic_regions, employment_df$employment_hrs_bin))degrees_of_freedom <- test$parameter
\(\chi^2\)-test for independence
Hypothesis – \(H_0\): The amount of hours worked by a student is independent of their region. \(H_1\): There is some interdependence between the amount of hours worked and region.
Assumptions – The observations are independent, and the expected cell counts are greater than or equal to 5. Despite some concerns raised in Section 1.1, the observations are likely to be independent from each other. This is, however, a limitation of the test as it is hard to absolutely verify this assumption. There were zero expected cell counts less than 5, so these assumptions hold.
Test Statistic – \[T = \sum_{i=1}^3 \sum_{j=1}^4 \frac{\left(Y_{i j}-e_{i j}\right)^2}{e_{i j}}\] Under \(H_0\), \(T\sim \chi^2_{6}\).
Observed Test Statistic – \(t_0=\) 35.82
p-value – \(p=P(\chi^2_{6} \geq t_0)<0.0001\).
Decision – As the \(p\) -value was \(<\alpha\), we can reject \(H_0\). This implies that there is some interdependence between hours worked in a week and a student’s region.
3 Conclusion
The geographic characteristics have been investigated in this report by grouping DATA2X02 students into regions and performing hypothesis tests on differing variables.
Throughout the analysis, it was seen that geographic regions played a statistically significant role in the distribution of Travel Method and Employment Hours per Week. Specifically, it was found that the method of travel of students is dependent on whether or not they live in Sydney’s City or Eastern Suburbs, and employment hours per week were dependent on region. There was no statistically significant evidence to suggest that the mean WAM of students from North Sydney is greater than those living in the Inner West.
Future investigation into DATA2X02 cohorts may look to validate these results (to see if they are consistent with all DATA2X02 cohorts or just the 2023 cohort), as well as source more specific geographical information about students, rather than using their postcode. An example of this could be using a respondent’s address to find the straight-line distance from their residents to the University, and investigate how this may influence student’s answers. If this was to happen, more strict data security would be required as a respondent’s address could be considered too personal to be released to every DATA2X02 student.
Cheng, Joe, Bhaskar Karambelkar, and Yihui Xie. 2023. Leaflet: Create Interactive Web Maps with the JavaScript ’Leaflet’library. https://CRAN.R-project.org/package=leaflet.
R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
RStudio Team. 2020. RStudio: Integrated Development Environment for r. Boston, MA: RStudio, PBC. http://www.rstudio.com/.
Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.”Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
---title: "A Statistical Investigation into the Geographic Characteristics of the DATA2X02 Cohort"date: "`r format(Sys.time(), '%d %B, %Y')`"author: "SID: 520468531"title-block-banner: "#d85f33"format: html: self_contained: true theme: - style.scss - united embed-resources: true code-fold: true code-tools: truetable-of-contents: truenumber-sections: truebibliography: bibliography.bibpage-layout: fullsidebar-width: 0pxfig-align: centercss: style.css---# Introduction```{r message=FALSE}# KNITR MUST BE VERSION 1.42 TO RENDER MAPS# Library Importslibrary('tidyverse')library('janitor')library("scales")library("sf")library('leaflet')library('tippy')library('xfun')library('ggpubr')library('flextable')# Needed to clean names for the inline code during introduction. More involved cleaning will be discussed.raw_df <- readr::read_csv('Data/DATA2x02 survey (2023) (Responses) - Form responses 1.csv') |> janitor::clean_names()# Adding tool tips for a few key termstippy::tippy_this(elementId ="random_sample", tooltip ="When all members of a population have equal likelihood to be sampled.")tippy::tippy_this(elementId ="wam", tooltip ="Weighted Average Mark")```DATA2X02 is a group of two units -- [DATA2002](https://www.sydney.edu.au/units/DATA2002) and [DATA2902](https://www.sydney.edu.au/units/DATA2902) -- offered within the School of Mathematics and Statistics at The University of Sydney. The units teach "advanced data analytic skills for a wide range of problems and data" [@DATA2902] with a focus on statistical methods to analyse and answer a scientific question.## Survey Method and Random Sampling {#sec-rs}The [raw dataset](https://docs.google.com/spreadsheets/d/e/2PACX-1vR9Ve_Zi-dM5K96ku2tnmaMrpX3Gk3y9KHcYsNIkzyyna8tRWOxBt_iDsZI_UzXFHLidPU6vY9bml4n/pub?output=csv) provided was sourced from a [cohort survey](https://pages.github.sydney.edu.au/DATA2002/2023/extra/DATA2x02_survey_2023.pdf) which aimed to gain insight into the units' cohort. Despite efforts to encourage student participation in the survey through an Ed Discussion Announcement and multiple reminders in labs and lectures, the response rate was `r scales::percent(nrow(raw_df)/759)`.It is important to note that due to this method of communication, there exists an argument that the survey participants may not have been a [[random sample]{style="text-decoration: underline;"}]{id='random_sample'} of DATA2X02 students. Students who were less engaged -- possibly not attending lectures, labs, or interacting with the Ed Discussion Board -- are considerably less likely to have completed the survey compared to their counterparts who received multiple prompts. Moreover, those who are more engaged are likely to take time out of their day to fill out the survey after a reminder. This is evidenced by DATA2902 (the advanced stream of DATA2X02) having a response rate of `r scales::percent(nrow(filter(raw_df, which_unit_are_you_enrolled_in == 'DATA2902'))/84)` compared to DATA2002's rate of `r scales::percent(nrow(filter(raw_df, which_unit_are_you_enrolled_in == 'DATA2002'))/675)`. Students could also submit the survey multiple times, which may have skewed the data towards an individuals who submitted multiple responses, but EDA showed that this did not occur in a major way.Whilst acknowledging these shortcomings of the sampling method and subsequent response pattern, it is taken that the survey still offers a moderately random sample of the DATA2X02 cohort and that responses were (for the most part) independent from one another. For more detailed analysis, a new dataset should be sourced from a different surveying method, ensuring that more students submit the survey and restricting responses to one per person.## Sources of BiasThere are some potential biases that may have occurred during this survey.- **Non-response Bias** -- As discussed in @sec-rs, there may have been a non-response bias within the survey.Specifically, we see a difference in response rates between DATA2902 and DATA2002 students. This may have skewed the sample data towards the population of DATA2902 students, rather than DATA2X02 as a whole.This would be an issue if there is a major difference between the populations of the two units. This is not out of the question, as those who opt to take an advanced stream of a unit may be more willing to challenge themselves and put more effort into their studies.Moreover, there is the possibility that students do not opt for an advanced unit in order to priorities other aspects of their lives, such as work.- **Social desirability/conformity bias** -- Many of the questions asked in the survey have an associated 'socially desirable' response. For example, students may, whether consciously or unconsciously, overestimate the amount of hours they exercise, or underestimate the amount of time they spend on social media as these answers come with positive social connotations. Moreover, students may want to conform to the expected answer of the population.An example of this may be the question of whether or not students had experience in R coding. The majority of the DATA2X02 would have had experience in R as it was taught in many prerequisite courses, so those who didn't have experience may have answered incorrectly to conform with the rest of the cohort.- **Recall Bias** -- Even if students did not suffer from social desirability or conformity bias, they may have simply not been able to recall the correct answer to a question. An example of this would be someone's WAM. Many students may not know their actual WAM (as it is not reported when getting results or on the online academic transcript), and so they could incorrectly recall it when answering the survey. An instance of this is seen in the WAMs reported, with `r numbers_to_words(raw_df |> filter(what_is_your_wam >= 99) |> nrow())` students reporting their WAM of 99 or above, a value that could potentially be less accurate due to difficulties in recall or a deliberate distortion.## Possible ImprovementsThere are many possible improvements that help to generate useful data. Many of the questions regarding numeric data did not specify units in which an answer should be, or whether the units should be included in the answer. This can be changed by specifying units in the question and only allowing numeric data to be input into the survey rather than free text. One such question was `How much sleep do you get (on average, per day)?`. A better wording of this question would be `How much sleep (in hours per night) do you get on average?`. This was also an issue for the question `How tall are you?`, where answers were not given in a uniform manner. Rewording to `How tall are you in cm?` would have produced data that required much less cleaning. This extends to `What is your shoe size?`, where students responded with both US and European shoe sizes which are on a very different scale (a US 10 is a 43 European). There were also issues regarding the categorical data. The question `Would you prefer to study at Fisher Library or SciTech Library?` did not need to include an `Other` response, as any answer of this type would not be answering the question asked. Moreover, the question `Do you work?` did not align with the suggested responses given. This question should have been `What is your current employment status?`. A similar issue was seen in this question `Do you submit assignments on time?`, which should have been `How often do you submit assignments on time?`. Finally, some questions could have included some options and an `Other` response, rather than free text. This was a particular issue for `What brand is your laptop?` and `What is your favourite social media platform?`, where students gave answers in many different forms when referring to the same category, e.g. Apple and Macbook being the same laptop brand. By providing some pre-defined answers, this would reduce the need for data cleaning.## Report OutlineThis report will focus on the geographical characteristics of the cohort, with the `Postcode` of each response being used as a proxy for where a student lives. Specifically, hypothesis testing will be used to determine the impact of a student's geographical region on a variety of variables.SA4s are the "largest sub-State regions" and "represent labour markets or groups of labour markets within each State and Territory" [@ABS2021], with each SA4 having approximately 300,000 - 500,000 residents in metropolitan areas. These regions will be used to group together students into the geographical areas with 'geographical, social and economic similarities' [@ABS2021]. @fig-sa4-map is a map made using `Leaflet`[@leaflet2023] which showcases the SA4s of Greater Sydney^[Shape Files used in this map are available [here](https://www.abs.gov.au/AUSSTATS/subscriber.nsf/log?openagent&1270055001_sa4_2016_aust_midmif.zip&1270.0.55.001&Data%20Cubes&7512AFCD3D8FED2DCA257FED001451F6&0&July%202016&12.07.2016&Latest) [@ABS2016]].```{r warning=FALSE, results=FALSE}sa4_df <-st_read('Data/1270055001_sa4_2016_aust_shape')sa4_df_filter <- sa4_df |>filter(GCC_NAME16 =='Greater Sydney')``````{r warning=FALSE}#| label: fig-sa4-map#| fig-cap: "Map of SA4s in Greater Sydney [@ABS2016], [@leaflet2023]"p_popup <-paste0("<strong>Name: </strong>", sa4_df_filter$SA4_NAME16)leaflet(sa4_df_filter) %>%addPolygons(popup = p_popup,fillColor ='#fcbaa2',opacity =1.0,weight =2,color ="#d85f33",fillOpacity =0.2) %>%addTiles()```## Data CleaningA variety of data cleaning has been done in R [@R2023] and R Studio [@rstudio] utilising the `tidyverse` packages [@tidyverse2019]. The `janitor` package [@janior2023] was initially used to help standardise the names of each column so that a reproducible introduction could be made. A new naming convention for the columns was adopted based on @tarr2023. Some summary tables have also been created using `gt`[@gt2023].<details><summary>Column Name Conversion Table</summary>```{r message=FALSE}#| label: fig-conversions#| fig-cap: "Column Name Conversion Table [@tarr2023], [@gt2023]"raw_df <- readr::read_csv('Data/DATA2x02 survey (2023) (Responses) - Form responses 1.csv')old_names <-colnames(raw_df)df <- raw_dfnew_names <-c("timestamp", "n_units", "task_approach", "age","life", "fass_unit", "fass_major", "novel","library", "private_health", "sugar_days", "rent","post_code", "haircut_days", "laptop_brand","urinal_position", "stall_position", "n_weetbix", "food_budget","pineapple", "living_arrangements", "height", "uni_travel_method","feel_anxious", "study_hrs", "work", "social_media","gender", "sleep_time", "diet", "random_number","steak_preference", "dominant_hand", "normal_advanced", "exercise_hrs","employment_hrs", "on_time", "used_r_before", "team_role","social_media_hrs", "uni_year", "sport", "wam", "shoe_size")colnames(df) <- new_namesname_combo <- dplyr::bind_cols(`New Names`= new_names, `Original Names`= old_names)name_combo |> gt::gt() |> gt::tab_header(title ="Column Name Cleaning") |> gt::tab_options(heading.title.font.weight ='bolder', column_labels.font.weight ='bold')```</details>The SA4 name of each respondent was joined to the survey data using a reference table made by @Proctor2023. The HTML Table on the website [@Proctor2023] was converted into a CSV file for easier manipulation [@csv2023].```{r warning=FALSE, message=FALSE}#| label: fig-sa4#| fig-cap: "Students' SA4s Summary Table [@gt2023]"sa4_postcode_df <- readr::read_csv('Data/sa4_postcode.csv') |>select(c(`Postcode`, `SA4 Name`)) |>unique() |>filter(!((`Postcode`==2232) & (`SA4 Name`=='Southern Highlands and Shoalhaven')))colnames(sa4_postcode_df) <-c('post_code', 'sa4_name')sa4_postcode_df$post_code <-as.character(sa4_postcode_df$post_code) df$post_code <-as.character(gsub("[^0-9]", "", df$post_code))df <- df |>left_join(sa4_postcode_df)df |>count(sa4_name) |>arrange(desc(n)) |> gt::gt() |> gt::cols_label(sa4_name ="SA4 Name", n='Count of Students') |> gt::tab_header(title ="Count of Students by SA4") |> gt::tab_options(heading.title.font.weight ='bolder', column_labels.font.weight ='bold')```The SA4s were further grouped together geographically to collapse some of the groups with lower student counts.@fig-region-map is a map of the groupings of SA4s into regions. A conversion table was generated using `flextable`[@flextable2023].<details><summary>SA4 to Region Conversion Table</summary>```{r message=FALSE}north_sydney =c('Sydney - North Sydney and Hornsby', 'Sydney - Ryde', 'Sydney - Northern Beaches')city_and_eastern_suburbs =c('Sydney - City and Inner South', 'Sydney - Eastern Suburbs')inner_west =c('Sydney - Inner West', 'Sydney - Parramatta', 'Sydney - Inner South West')df <- df |>mutate(geographic_regions =case_when( sa4_name %in% north_sydney ~'North Sydney', sa4_name %in% city_and_eastern_suburbs ~'City and Eastern Suburbs', sa4_name %in% inner_west ~'Inner West',!is.na(sa4_name) ~'Outer South West, Greater Sydney and Regional NSW',TRUE~NA ))mapping_df <- df |>select(geographic_regions, sa4_name) |>unique() |>drop_na() |>arrange(geographic_regions) |>mutate(`Region`= geographic_regions, `SA4 Name`=sa4_name) |>select(Region, `SA4 Name`)flextable(mapping_df) |>merge_v() |>theme_vanilla() |>width(2, 4) |>width(1, 2)```</details>```{r messgae=FALSE, warning=FALSE}#| label: fig-region-map#| fig-cap: "Map of SA4s grouped into Regions for students in DATA2X02 [@leaflet2023]"sa4_df_in_survey <- sa4_df |>filter(SA4_NAME16 %in% df$sa4_name)sa4_df_in_survey <- sa4_df_in_survey |>mutate(geographic_regions =case_when( SA4_NAME16 %in% north_sydney ~'North Sydney', SA4_NAME16 %in% city_and_eastern_suburbs ~'City and Eastern Suburbs', SA4_NAME16 %in% inner_west ~'Inner West',!is.na(SA4_NAME16) ~'Outer South West, Greater Sydney and Regional NSW',TRUE~NA ))factpal <-colorFactor(c('darkgreen', 'darkblue', 'darkred', 'purple'), sa4_df_in_survey$geographic_regions)p_popup <-paste0("<strong>Name: </strong>", sa4_df_in_survey$SA4_NAME16)leaflet(sa4_df_in_survey) |>addPolygons(popup = p_popup,fillColor =~factpal(geographic_regions),opacity =1.0,weight =2,color =~factpal(geographic_regions),fillOpacity =0.1) |>addTiles() |>addLegend("bottomleft", pal = factpal, values =~geographic_regions, title='Region')```<br>A flagging column was made that identified if someone travelled to the university by car.```{r}#| label: fig-car#| fig-cap: "Students' Method of Travel Summary Table [@gt2023]"df <- df |>mutate(car_flag =ifelse(str_detect(uni_travel_method, "Car"),"Drive", ifelse(is.na(uni_travel_method), NA, "Other")))df |>count(car_flag) |> gt::gt() |> gt::cols_label(car_flag ="Does the Student Drive to Univeristy?", n='Count of Students') |> gt::tab_header(title ="Count of Students by Whether or Not they Travel by Car") |> gt::tab_options(heading.title.font.weight ='bolder', column_labels.font.weight ='bold')```<br>The employment hours of each respondent were binned into categories of $0$ hours, $1-10$ hours, and $11+$ hours. This was done to provide a more numerical approach to a respondent's employment status. Even if someone were to have the same employment status, their actual work hours may be greatly different. So these bins were created in order to have a better understanding of how much a respondent works. Moreover, bins were used as a near majority (`r scales::percent(length(filter(df, employment_hrs == 0)$employment_hrs)/length(df$employment_hrs))`) of respondents worked no hours, which skewed the means of groups towards $0$.```{r}#| label: fig-hours#| fig-cap: "Students' Employment Hours Summary Table [@gt2023]"bin_ranges <-c(0, 1, 10.5, Inf)bin_labels <-c("0", "1-10","11+")# Create a new column with binned valuesdf$employment_hrs_bin <-cut(df$employment_hrs, breaks = bin_ranges, labels = bin_labels, include.lowest =TRUE)df |>count(employment_hrs_bin) |> gt::gt() |> gt::cols_label(employment_hrs_bin ="Employment Hours", n='Count of Students') |> gt::tab_header(title ="Count of Students by Employment Hours") |> gt::tab_options(heading.title.font.weight ='bolder', column_labels.font.weight ='bold')```<br>Outliers of WAM were set to `NA`, as this may be international students who have a different WAM system or people who do not know their WAM.It was judged at the $\pm 3$ standard deviations from the mean, which is a common method for removing outlines. @fig-wam-hist shows a histogram of respondents' WAM.```{r warning=FALSE}#| label: fig-wam-hist#| fig-cap: "Histogram of students' WAM with outliers removed [@ggplot]"remove_outlier <-function(vec){ threshold1 =mean(vec[!is.na(vec)]) +3*sd(vec[!is.na(vec)]) threshold2 =mean(vec[!is.na(vec)]) -3*sd(vec[!is.na(vec)]) vec[vec > threshold1 | vec < threshold2] <-NAreturn(vec)}df[['wam']] <-remove_outlier(df[['wam']])df %>%ggplot(aes(x=wam)) +geom_histogram(bins =20, fill ="#fcbaa2", color ="#d85f33") +labs(x="WAM", y="Frequency", title="Histogram of Students' WAM with Outliers Removed") +theme(legend.position="none", plot.background =element_rect(fill ="#ffffff", linewidth =0), axis.title =element_text(face="bold"), plot.title =element_text(face="bold", size =13, hjust =0.5))```<br>In all hypothesis tests, the `NA` values of the columns investigated were removed as these either represented answers that were not filled out from the respondent, or values which were deemed outliers.# Hypothesis Testing## Does living in Sydney's City and Eastern Suburbs influence if students drive to university?As the University of Sydney is located in Sydney's City and Eastern Suburbs, it is suspected that students may opt for the use of public transport, rather than driving to the university, if they live close to the university.This is of interest as effective carbon emissions of the University can be reduced if more students use public transport. This is also suggested from EDA which showed uneven proportions between the two groups, as seen in @fig-drive_graph.```{r}#| label: fig-drive_graph#| fig-cap: "Proportion bar chart of travel method for different regions [@ggplot]"car_df <- df |>select(c(geographic_regions, car_flag)) |>mutate(geographic_regions =ifelse(geographic_regions =='City and Eastern Suburbs', 'City and Eastern Suburbs', 'Other')) |>drop_na() |>mutate(`Travel Method`= car_flag)car_df |>ggplot() +aes(x=geographic_regions, fill=`Travel Method`) +geom_bar(colour ="black",linewidth =0.5,position ="fill") +labs(y="Proportion of Travel Method",x="Region", title="Proportion of Students who drive to Univerity \n based on Geograhical Location",legend="Travel Method") +theme(plot.background =element_rect(fill ="#ffffff",linewidth =0),legend.background =element_rect(fill ="#ffffff", linewidth =0),panel.border =element_rect(colour ="black", fill=NA),legend.box.background =element_rect(colour ="black"),axis.title =element_text(face="bold"), plot.title =element_text(face="bold", size =14, hjust =0.5)) +scale_y_continuous(labels = scales::percent) +scale_fill_brewer(palette ="Set2")```A $\chi^2$-test for independence was performed at the $\alpha = 0.05$ level on the below contingency table. A Monte-Carlo simulation of size $6000$ was used to calculate the distribution of test statistics and $p$ value.```{r}#| label: fig-car_cong#| fig-cap: "Contingency Table of Region and Method of Travel [@gt2023]"contingency_table <-table(car_df$geographic_regions, car_df$car_flag) |>as.data.frame.matrix()contingency_table$`Region`=c('City and Eastern Suburbs', 'Other')contingency_table |> gt::gt() |> gt::cols_move_to_start(columns=c(`Region`)) |> gt::tab_spanner(label ="Method of Travel", columns =1:2) |> gt::tab_header(title ="Count of Students by Method of Travel") |> gt::tab_options(heading.title.font.weight ='bolder', column_labels.font.weight ='bold')``````{r}set.seed(1)test <-chisq.test(table(car_df$car_flag, car_df$geographic_regions),simulate.p.value=TRUE, B=6000)```::: {.callout-note icon=false}## $\chi^2$-test for independence1. **Hypothesis** -- $H_0$: The method of travel of a student is independent of whether or not they live in Sydney's City and Eastern Suburbs.$H_1$: There is some interdependence between the a student's method of travel and whether or not they live in Sydney's City and Eastern Suburbs.2. **Assumptions** - The test statistic is a good measure of independence and that the resampling process provides a good estimation for the distribution of the test statistic. These assumptions are true as, at the limit of resampling an infinite amount of times, the test statistic is distributed the same as a $\chi^2$-test for independence.The observations are independent.Despite some concerns raised in @sec-rs, the observations are likely to be independent from each other.3. **Test Statistic** -- $$T = \sum_{i=1}^2 \sum_{j=1}^2 \frac{\left(Y_{i j}-e_{i j}\right)^2}{e_{i j}}$$4. **Observed Test Statistic** -- $t_0=$ `r round(as.numeric(str_replace_all(test$statistic, "[^(0-9)|.]", "")), 2)`.5. **p-value** -- The proportion of simulated test statistics that were as or more extreme than $t_0$ was $p=$ `r as.character(round(test$p.value, 5))`.6. **Decision** -- As the $p$ -value was $<\alpha$, we cannot reject $H_0$. This implies that there is some interdependence between the method of travel and whether or not a student lives in Sydney's City and Eastern Suburbs.:::## Is academic performance significantly better for students living in North Sydney compared to those in the Inner West?A student's [[WAM]{style="text-decoration: underline;"}]{id='wam'} is one measure of academic performance. Knowing if WAM is impacted by where students live could be useful, as it could allow the University to provide targeted academic support.```{r}wam_df <- df |>filter(geographic_regions %in%c('Inner West', 'North Sydney')) |>select(geographic_regions, wam) |>drop_na()inner_west_wam <-filter(wam_df, geographic_regions=="Inner West")$wamnorth_sydney_wam <-filter(wam_df, geographic_regions=="North Sydney")$wam```A Welch two-sample one-sided $t$-test at the $\alpha = 0.05$ level was conducted to determine if the mean WAM of students in North Sydney is larger than those living in the Inner West. This test is less powerful then a two-sample t-test, but forgoes the assumption of equal variance. This was done because `North Sydney` has a variance of `r round(var(north_sydney_wam),1)` compared to `Inner West` having `r round(var(inner_west_wam),1)`.Initial EDA suggests this may be the case, with the mean WAM of students being `r round(mean(north_sydney_wam),1)` and `r round(mean(inner_west_wam),1)` respectively. We can also generate a QQ-plot of students' WAM, which shows the variable is likely to be normally distributed as it follows a linear regression.```{r}#| layout-nrow: 1#| label: fig-box-plot#| fig-cap: #| - "Box plot of WAMs of students from the Inner West and North Sydney [@ggplot]"#| - "QQ-plot of WAMs of students from the Inner West and North Sydney [@ggpubr]"wam_df |>ggplot() +aes(y=wam, color=geographic_regions, x=geographic_regions, fill=geographic_regions) +geom_boxplot()+geom_jitter(width=0.05, size=1) +scale_color_manual(values =c('darkblue','darkred')) +scale_fill_manual(values =c(rgb(214/255, 214/255, 232/255), rgb(230/255, 215/255, 214/255))) +theme(legend.position="none") +theme(plot.background =element_rect(fill ="#ffffff",linewidth =0),panel.border =element_rect(colour ="black", fill=NA),legend.box.background =element_rect(colour ="black"),axis.title =element_text(face="bold"), plot.title =element_text(face="bold", size =14, hjust =0.5)) +labs(y="WAM",x="Region", title="A: Grouped Box Plot of WAM by Region")ggqqplot(wam_df, x ="wam", facet.by ="geographic_regions", color ="geographic_regions", palette=c('darkblue','darkred'), legend='none', title="B: QQ-plot of WAM") +theme(plot.background =element_rect(fill ="#ffffff",linewidth =0),panel.border =element_rect(colour ="black", fill=NA),legend.box.background =element_rect(colour ="black"),axis.title =element_text(face="bold"), plot.title =element_text(face="bold", size =14, hjust =0.5))test <-t.test(north_sydney_wam, inner_west_wam, alternative ='greater')shapiro1 <-shapiro.test((wam_df |>filter(geographic_regions =='Inner West'))$wam)shapiro2 <-shapiro.test((wam_df |>filter(geographic_regions =='North Sydney'))$wam)degrees_of_freedom <- test$parameter``````{r fig.width=10}#| label: fig-hist#| fig-cap: "Histogram of WAMs of students from the Inner West and North Sydney [@ggplot]"wam_df |>ggplot() +aes(color=geographic_regions, x=wam, fill=geographic_regions) +geom_histogram(bins=15) +facet_wrap(~geographic_regions, scales ="free_y") +scale_color_manual(values =c('darkblue','darkred')) +scale_fill_manual(values =c(rgb(214/255, 214/255, 232/255), rgb(230/255, 215/255, 214/255))) +labs(title="C: Histogram of WAM based on Region",x="WAM",y="Frequency") +theme(plot.background =element_rect(fill ="#ffffff",linewidth =0),legend.background =element_rect(fill ="#ffffff",linewidth =0),panel.border =element_rect(colour ="black", fill=NA),legend.box.background =element_rect(colour ="black"),axis.title =element_text(face="bold", size=16),axis.text =element_text(size =14),legend.text =element_text(size =12),plot.title =element_text(face="bold",size =18,hjust =0.5),legend.position ="none")```<br>::: {.callout-note icon=false}## Welch two-sample one-sided $t$-test1. **Hypothesis** -- $H_0$: The mean WAM of students from North Sydney $\mu_{NS}$ equals the mean WAM of students from the Inner West $\mu_{IW}$. $H_1$: $\mu_{NS}$ is greater than $\mu_{NS}$.2. **Assumptions** -- The observations of both groups were independently and identically distributed to $\mathcal{N}(\mu_{i}, \sigma_{i}^2)$ for $i=NS, IW$, and that the observations of each group were independent.Despite some concerns raised in @sec-rs, the groups are likely to be independent from each other. A student may have responded twice with different values of WAM or Postcode, which could make the groups not independent.The above QQ-plot (@fig-box-plot-2) shows that the WAM is likely to be normally distributed. Moreover using a Shapiro-Wilk test, both groups were consistent with a $X\sim\mathcal{N}(\mu_{i}, \sigma_{i}^2)$, with p values of `r round(shapiro1$p, 3)` for `Inner West` and `r round(shapiro2$p, 3)` for `North Sydney`.3. **Test Statistic** -- $$T=\frac{\overline{NS}-\overline{IW}}{\sqrt{\frac{S_{ns}^2}{n_{ns}}+\frac{S_{iw}^2}{n_{iw}}}}$$Here, $S_{ns}^2$ and $S_{iw}^2$ are the sample variance of the $NS$ (North Sydney) and $IW$ (Inner West) samples.Under $H_0$, $T\sim t_{\nu}$, where $\nu=$ `r round(degrees_of_freedom,2)` as estimated from the data.4. **Observed Test Statistic** -- $t_0=$ `r round(as.numeric(str_replace_all(test$statistic, "[^(0-9)|.]", "")), 2)`5. **p-value** -- $p = P\left(t_\nu \geq t_0\right)=$ `r as.character(round(test$p.value, 3))`6. **Decision** -- As the $p$ -value was $>\alpha$, we cannot reject $H_0$ and say the data is consistent with the mean WAM of students in North Sydney and the Inner West being equal.:::## Does a student's Region have a significant influence on how many hours they work?Initial exploration of the data set suggested that there was a non-uniform distribution of working hours across different regions. The proportion of students working between one and 10 hours per week was relatively similar, and the main differences were observed when comparing the proportion of students working no hours or more than 11 hours a week.```{r}#| label: fig-hours_graph#| fig-cap: "Proportion bar chart of hours worked for different regions [@ggplot]"employment_df <- df |>select(c(geographic_regions, employment_hrs_bin)) |>drop_na() |>mutate(`Employment Hours per Week`= employment_hrs_bin)employment_df |>ggplot() +aes(x=geographic_regions, fill=`Employment Hours per Week`) +geom_bar(colour ="black",linewidth =0.5,position ="fill") +labs(y="Proportion of Hours Worked Category",x="Region", title="Proportion of Students in Hours Worked by Region",legend="Travel Method") +theme(plot.background =element_rect(fill ="#ffffff",linewidth =0),legend.background =element_rect(fill ="#ffffff", linewidth =0),panel.border =element_rect(colour ="black", fill=NA),legend.box.background =element_rect(colour ="black"),axis.title =element_text(face="bold"), plot.title =element_text(face="bold", size =14, hjust =0.5)) +scale_y_continuous(labels = scales::percent) +scale_x_discrete(labels=c("City and \n Eastern Suburbs", "Inner West", "North Sydney", "Outer South West, \n Greater Sydney and \n Regional NSW")) +scale_fill_brewer(palette ="Set2")```A $\chi^2$-test for independence was performed at the $\alpha = 0.05$ level on the below contingency table.Yates's correction for continuity was used in the test.```{r}#| label: fig-car_hours_worked#| fig-cap: "Contingency Table of Region and Hours Worked [@gt2023]"contingency_table <-table(employment_df$geographic_regions, employment_df$employment_hrs_bin) |>as.data.frame.matrix()contingency_table$`Region`=c('City and Eastern Suburbs', 'Inner West', 'North Sydney', 'Outer South West,\n Greater Sydney and \n Regional NSW')contingency_table |> gt::gt() |> gt::cols_move_to_start(columns=c(`Region`)) |> gt::tab_spanner(label ="Hours Worked", columns =1:3) |> gt::tab_header(title ="Count of Students by Hours Worked") |> gt::tab_options(heading.title.font.weight ='bolder', column_labels.font.weight ='bold')test <-chisq.test(table(employment_df$geographic_regions, employment_df$employment_hrs_bin))degrees_of_freedom <- test$parameter```::: {.callout-note icon=false}## $\chi^2$-test for independence1. **Hypothesis** -- $H_0$: The amount of hours worked by a student is independent of their region. $H_1$: There is some interdependence between the amount of hours worked and region.2. **Assumptions** -- The observations are independent, and the expected cell counts are greater than or equal to 5.Despite some concerns raised in @sec-rs, the observations are likely to be independent from each other.This is, however, a limitation of the test as it is hard to absolutely verify this assumption. There were `r numbers_to_words(sum(test$expected < 5))` expected cell counts less than 5, so these assumptions hold.3. **Test Statistic** -- $$T = \sum_{i=1}^3 \sum_{j=1}^4 \frac{\left(Y_{i j}-e_{i j}\right)^2}{e_{i j}}$$Under $H_0$, $T\sim \chi^2_{`r degrees_of_freedom`}$.4. **Observed Test Statistic** -- $t_0=$ `r round(as.numeric(str_replace_all(test$statistic, "[^(0-9)|.]", "")), 2)`5. **p-value** -- $p=P(\chi^2_{`r degrees_of_freedom`} \geq t_0)<0.0001$.6. **Decision** -- As the $p$ -value was $<\alpha$, we can reject $H_0$. This implies that there is some interdependence between hours worked in a week and a student's region.:::# ConclusionThe geographic characteristics have been investigated in this report by grouping DATA2X02 students into regions and performing hypothesis tests on differing variables.Throughout the analysis, it was seen that geographic regions played a statistically significant role in the distribution of `Travel Method` and `Employment Hours per Week`. Specifically, it was found that the method of travel of students is dependent on whether or not they live in Sydney's City or Eastern Suburbs, and employment hours per week were dependent on region. There was no statistically significant evidence to suggest that the mean WAM of students from North Sydney is greater than those living in the Inner West.Future investigation into DATA2X02 cohorts may look to validate these results (to see if they are consistent with all DATA2X02 cohorts or just the 2023 cohort), as well as source more specific geographical information about students, rather than using their postcode. An example of this could be using a respondent's address to find the straight-line distance from their residents to the University, and investigate how this may influence student's answers. If this was to happen, more strict data security would be required as a respondent's address could be considered too personal to be released to every DATA2X02 student.